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Abstract: We study a high-dimensional generalized linear model and penal- 
ized empirical risk minimization with l\ penalty. Our aim is to provide a 
non-trivial illustration that non-asymptotic bounds for the estimator can be 
obtained without relying on the chaining technique and/or the peeling device. 



' 1. Introduction 

^ (— I We study an increment bound for the empirical process, indexed by linear com- 

binations of highly correlated base functions. We use direct arguments, instead of 
the chaining technique. We moreover obtain bounds for an M-estimation problem 
inserting a convexity argument instead of the peeling device. Combining the two 
results leads to non-asymptotic bounds with explicit constants. 

Let us motivate our approach. In M-estimation, some empirical average indexed 
by a parameter is minimized. It is often also called empirical risk minimization. 
To study the theoretical properties of the thus obtained estimator, the theory of 
qq \ empirical processes has been a successful tool. Indeed, empirical process theory 

studies the convergence of averages to expectations, uniformly over some parameter 
set. Some of the techniques involved are the chaining technique (see e.g. (l3|). in 
■ order to relate increments of the empirical process to the entropy of parameter 

C — ■ \ space, and the "peeling device (a terminology from [13] ) which goes back to [H, 

which allows one to handle weighted empirical processes. Also the concentration 
inequalities (see e.g. Q), which consider the concentration of the supremum of the 
empirical process around its mean, are extremely useful in M-estimation problems. 

A more recent trend is to derive non-asymptotic bounds for M-estimators. The 
papers and 0] provide concentration inequalities with economical constants. 
This leads to good non-asymptotic bounds in certain cases Generally however, 
both the chaining technique and the peeling device may lead to large constants in 
the bounds. For an example, see the remark following ((5]). 

Our aim in this paper is simply to avoid the chaining technique and the peeling 
device. Our results should primarily be seen as non-trivial illustration that both 
techniques may be dispensable, leaving possible improvements for future research. 
In particular, we will at this stage not try to optimize the constants, i.e. we will 
make some arbitrary choices. Moreover, as we shall see, our bound for the increment 
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involves an additional log-factor, logm, where m is the number of base functions 
(see below). 

The M-estimation problem we consider is for a high-dimensional generalized 
linear model. Let Y e y C R be a real-valued (response) variable and i be a 
covariate with values in some space X . Let 



k fc=i ) 

be a (subset of a) linear space of functions on X . We let G be a convex subset of 
R m , possibly 9 = R m . The functions {"0fc}fcLi form a given system of real- valued 
base functions on X. The number of base functions, m, is allowed to be large. 
However, we do have the situation m < n in mind (as we will consider the case of 
fixed design). 

Let 7/ : X x y — ► R be some loss function, and let {(xi, Yi)}f =1 be observations 
in X x y. We consider the estimator with l\ penalty 

f i " _a_ 2(i-., 1 

(1) n :=argmin^ - Vt/, a^Fi ) + (0) } , 

where 

m 

(2) I(6»):=$>*l 

fc=i 

denotes the £i norm of the vector 9 E R m . The smoothing parameter A„ controls 
the amount of complexity regularization, and the parameter s (0 < s < 1) is 
governed by the choice of the base functions (see Assumption B below). Note that 
for a properly chosen constant C depending on A„ and s, we have for any I > 0, 

,ih T m^i . f. T C 

Xn~ I 2 -" = mm XI 



2(l- 3 ) 

In other words, the penalty Xn s I 2 ~ s {9) can be seen as the usual Lasso penalty 
XI(9) with an additional penalty on A. The choice of the latter is such that adaption 
to small values of /(#*) is achieved. Here, 0* is the target, defined in |3]) below. 

The loss function 7/ is assumed to be convex and Lipschitz (see Assumption 
L below). Examples are the loss functions used in quantile regression, logistic re- 
gression, etc. The quadratic loss function ~ff(x,y) = (y — f{x)) 2 can be studied as 
well without additional technical problems. The bounds then depend on the tail 
behavior of the errors. 

The covariates assumed to be fixed, i.e., we consider the case of 

fixed design. For 7 : X x y — > R, use the notation 



1 " 

P 7 :=- VE 7 (i <i y j ). 

n ^ — ^ 



n 
»=i 



Our target function (9* is defined as 

(3) 0* :=argmmP7 /e 
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When the target is sparse, i.e., when only a few of the coefficients 0* k are 

nonzero, it makes sense to try to prove that also the estimator 9 n is sparse. Non- 
asymptotic bounds for this case (albeit with random design) are studied in [l2| ■ It 
is assumed there that the base functions {ipk} have design matrix with eigenval- 
ues bounded away from zero (or at least that the base functions corresponding to 
the non-zero coefficients in 9* L have this property). In the present paper, the base 
functions are allowed to be highly correlated. We will consider the case where they 
form a VC class, or more generally, have e-covering number which is polynomial in 
1/e. This means that a certain smoothness is imposed a priori, and that sparseness 
is less an issue. 

We use the following notation. The empirical distribution based on the sam- 
ple {(xi,li)}" =1 is denoted by P n , and the empirical distribution of the covari- 
ates {xi}™ =1 is written as Q n . The L 2 (Q„) norm is written as || • |j„. Moreover, 
|| • Hoc denotes the sup norm (which in our case may be understood as ||/||oo = 
maxi<i<„ |/(a?i)|, for a function / on X). 

We impose four basic assumptions: Assumptions L, M, A and B. 

Assumption L. The loss function 7/ is of the form 7/(x,y) = 7(/(^),y), where 
j(-,y) is convex for all y €y. Moreover, it satisfies the Lipschitz property 

h(fe(x),y)-l(fg(x),y)\ < \f e (x) - f§(x)\,V (x,y) e X x y, V 9, e 6. 

Assumption M. There exists a non-decreasing function a(-), such that all M > 
and all all 9 E 9 with \\fg — fg* ||oo < M, one has 

P(7/*-7/ 8 ,)> \\f e -f e .\\ 2 J<T 2 (M). 

Assumption M thus assumes quadratic margin behavior. In more general 
margin behavior is allowed, and the choice of the smoothing parameter does not 
depend on the margin behavior. However, in the setup of the present paper, the 
choice of the smoothing parameter does depend on the margin behavior. 

Assumption A. It holds that 

HV'fclloo < 1) I < k <m. 

Assumption B. For some constant A > 1, and for V = 2/s — 2, it holds that 

N(e,V) < Ae- y ,y e > 0. 

Here N(e, i S) denotes the e-covering number of (\&, || • || n ), with ^ := {ipk}k=i- 

The paper is organized as follows. Section [2] presents a bound for the increments 
of the empirical process. Section [3] takes such a bound for granted and presents a 
non-asymptotic bound for \\fg — /e»||„ and I(9 n ). The two sections can be read 
independently. In particular, any improvement of the bound obtained in Section [2] 
can be directly inserted in the result of Section [3] The proofs, which are perhaps 
the most interesting part of the paper, are given in Section 21 

2. Increments of the empirical process indexed by a subset of a linear 
space 

Let £1, ...,£„ be i.i.d. random variables, taking values ±1 each with probability 
1/2. Such a sequence is called a Rademacher sequence. Consider for e > and 
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M > 0, the quantity 



Z e<M ■= sup 

||/ e ||„<e, I(0)<M 



1 " 

~y2fe(xi)e 

n — ' 



n 
i=l 



We need a bound for the mean EZ Ei m, because this quantity will occur in the 
concentration inequality ( Theorem 14. In [T^], the following trivial bound is used: 



/ ^ n 

E2 e , Af < ME max I - V e^fo) I 



\ Kk<m' n 

\ i=l / 

On the right hand side, one now has the mean of finitely many functions, which 
is easily handled (see for example Lemma |4. 1 1) . However, when the base functions 
ipk are highly correlated, this bound is too rough. We need therefore to proceed 
differently. 

Let conv('I') = {fg = Y^k=i ®ki>k ■ 9k > 0, Y^k=i ® k ~ ■*-} ^ e tne convex hull of 

Recall that s = 2/(2 + 1/), where V is from Assumption B. From e.g. [Io| . Lemma 
3.2, it can be derived that for some constant C, and for all e > 0, 



(4) E 



max 

/econv(*),||/||„<e n 



Jn 



1 

i=l 

The result follows from the chaining technique, and applying the entropy bound 
(5) logiV(e,conv(*)) < A e- 2{1 - s) , e > 0, 

which is derived in Q|. Here, Aq is a constant depending on V and A. 

Remark. It may be verified that the constant C in (fj| is then at least proportional 
to 1/s, i.e., it is large when s is small. 

Our aim is now to obtain a bound from direct calculations. Pollard (|8() presents 
the bound 

logiV(e,conv(*)) < Aie~ 2(1 ~ s) log -, e > 0, 

where A\ is another constant depending on V and A. In other words, Pollard's 
bound has an additional log-factor. On the other hand, we found Pollard's proof 
a good starting point in our attempt to derive the increments directly, without 
chaining. This is one of the reasons why our direct bound below has an additional 
logm factor. Thus, our result should primarily be seen as illustration that direct 
calculations are possible. 

Theorem 2.1. For e > 16/m, and m > 4, we have 



E 



max 

/econv(*),||/||„<e U 



1 - 



< 20V1 + 2Ae s 



log(6m) 



Clearly the set {J2T=i ®kipk ■ 1(0) < 1} is the convex hull of {±tpk} k n =1 - Using 
a renormalization argument, one arrives at the following corollary 

Corollary 2.1. We have for e/M > 8/m and m > 2 



EZ eM <20VTT4lM 1 - e -/ 1 ° g(12m) 



Non- asymptotic bounds for GLM 



125 



Invoking symmetrization, contraction and concentration inequalities (see Sec- 
tion [4]) , we establish the following lemma. We present it in a form convenient for 
application in the proof of Theorem 13. II 



Lemma 2.1. Define for e > 0, M > 0, and e/M > 8/m, m > 2, 

Z e , M := sup | (P„ - P) (7 /e - j fe , ) | . 

ll/«-/8«IU<e, I(9-6*)<M 

Let 



A„, := 80VTTIA 
Then it holds for all a > 0, that 

P (z eM > A„, e s M 1 - 



log(12m) 



27a 2 



< 



exp 



2 x {27a 2 



3. A non-asymptotic bound for the estimator 

The following theorem presents bounds along the lines of results in [l(| , [H| and 
[|| , but it is stated in a non-asymptotic form. It moreover formulates explicitly the 
dependence on the expected increments of the empirical process. 

Theorem 3.1. Define for e> and M > 0. 

Ze,M:= sup |( PT[ _p)( 7/e _ 7/e J|. 

I|/e-/«»l|»<e, I{e-6')<M 

Let A„,o be such that for all 8/m < e/M < I, we have 
(G) 



EZ e , M < A„. e s M 1 - s . 



Let c > 3 be some constant. 
Define 



M n := 25n=^j (27)" W^) C —I(e* n ), 
a 2 n :=a 2 (M n ), 



and 

Assume that 
(7) 



e n := V54a^- S c— A=7 s /— (C) V 27<A„, 



, 27\ 



ol A„ n V 8 / 



2 2(1— a) 

nA^c^J-^^*) 



Then for X n := ca^Xnfi? with probability at least 
1 — exp 

we have that 
and 



4(1-3) 



I(§„-9* n )<M n . 
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Let us formulate the asymptotic implication of Theorem 13.11 in a corollary. For 
positive sequences {a n } and {&„}, we use the notation 

d n ^ b n: 

when 

< lim inf -p- < lim sup -p- < oo . 

n->oo b n n^oo b n 

The corollary yields e.g. the rate e„ x n -1 / 3 for the case where the penalty rep- 
resents the total variation of a function / on {x%, . . . ,x n } C R (in which case 
s = 1/2). 

Corollary 3.1. Suppose that A and s do not depend on n, and that I{0* L ) x 1 and 
a 2 (M n ) x 1 for all M n x 1. By Q), we may take A„ x 1/y/n, in which case, with 
probability 1 — exp[— d n ], it holds that \\fg — fe*\\ n < e n , cind I(9 n — #*) < M n , 
with 

e n x n~^^), M n x 1, d n x ne 2 n x n^f . 



4. Proofs 

4-1. Preliminaries 

Theorem 4.1 (Concentration theorem Q). Let Z\, . . . , Z n be independent random 
variables with values in some space Z and let T be a class of real-valued functions 
on Z, satisfying 

for some real numbers and bi a and for all 1 < % < n and 7 G T. Define 



L 2 := supy^(b in - a in ) 2 /n, 
^ r i= i 



and 



Z := sup 

7er 



1 " 

-^(7(^-^7^)) 

7? Z ^ 



1=1 



Then for any positive z, 



P(Z > EZ + z) < exp 



2L2 



The Concentration theorem involves the expectation of the supremum of the 
empirical process. We derive bounds for it using symmetrization and contraction. 
Let us recall these techniques here. 

Theorem 4.2 (Symmetrization theorem 13]). Let Z\, . . . ,Z n be independent ran- 
dom variables with values in Z, and let £1,. .. ,e n be a Rademacher sequence inde- 
pendent of Z\, . . . , Z n . Let r be a class of real-valued functions on Z . Then 



E sup 

\7er 



]T{7(Z t )-£7(^)} 



< 2E sup 

\7er 
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Theorem 4.3 (Contraction theorem [5(). Let z%, . . . ,z n be non-random elements of 
some space Z and let J 7 be a class of real-valued functions on Z. Consider Lipschitz 
functions 7$ : R — * R ; i.e. 

\li{s) ~ li{s)\ < \s- s\, V s,s e R. 

Let £i,...,£„ be a Rademacher sequence. Then for any function f* ; Z — ► R. we 
/iaue 



E sup 



5>{7<(/(*))-7i(/*(*))} 



< 2E sup 



E^(/(^)-/*(^)) 



We now consider the case where T is a finite set of functions. 

Lemma 4.1. Let Z\,...,Z n be independent Z-valued random variables, and 
71, ... , 7„j 6e real-valued functions on Z, satisfying 

a-i.k < lk{Zi) < h ik , 

for some real numbers a^fc and bi t k and for all 1 < i < n and 1 < k < m. Define 



L 2 := max y^(b ik - a iik ) 2 /n, 



Then 



E max 

\ Kk<m 



1 " 

-V{7fe(^)-^7fe(^)} 

n — ' 



i=l 



< 2L 



log(3m) 



Proof. The proof uses standard arguments, as treated in e.g. [l3J]. Let us write for 
1 < k < m, 



7fe 



: =^E{7fe(^)-^7fe(^)|- 



By Hoeffding's inequality, for all z > 

P(l7fel > «) < 2exp 



Hence, 



nz 
21? 



Eexp 



4L 2 



1+/ P l7fc|> 



AL 2 



■\ogt dt 



Thus 



< 1 + 2 



1 



1 P 



dt = 3. 



E ( max |7fc| I = ^EW max logexp 



Kk<r 



< 



n \/ l<fc<m 

2L 



4L 27fe 



'logE max exp 

\Kfc<m 



4L 2 



< 2L 



log(3m) 



□ 
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4-2. Proofs of the results in Section^ 

Proof of Theorem \2.1[ Let us define, for k = 1 

1 - 

£,k ■= -22ipk(zi)s 

n z — ' 

i=l 

We have 



m. 



i=l fe=l 

Partition {1, . . . ,m} into N := iV(e s , sets Vj ;, J = 1, . . . , N, such that 



We can write 



where 



j n N 

- E fs{xi)ei = ^2 a i E 
n i=i j=i feev,- 

= a A°) '■= E °k, Pj,k = Pj,k{Q) ■= —■ 



keVi 



Set for j = 1,...,N, 



nj = nj {a) := 1+ L-^rbyJ- 



Choose 7Ttj- = TTt,j(d), t = 1, . . . , rij, j = 1, . . . , N independent random variables, 
independent of e\, . . . , e n , with distribution 

P(TT tJ = k) = p jlk , keVj, j = N. 

Let tpj = <pj(6) := J2Zi ^t,J n j and lj = £#) '■= ESi t*t,j/ n j- 

We will choose a realization {(</>*, £*) = (V>;(0),£(0))}jLi of {(^j de- 
pending on {ei}^-!, satisfying appropriate conditions (namely, © and ([IT))) below). 
We may then write 



X>6 



fc=i 



< 



N 



*:=i 



Consider now 



N 



Let ^4 W := {X)i=i a j = 1; a i — 0}- Endow with the l\ metric. The e-covering 
number D(e) of A N satisfies the bound 



D(e)< 



N 



Let A e be a maximal e-covering set of A . For all a G .4 there is an a' € ,A e such 



that |«j -a'-| < e. 
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We now write 



fc=l 



< 



i=i 

i(0) + + iii(6>) 



N 



/c=i 



iV 



Let II be the set of possible values of the vector {7r tJ - : t — 1, . . . , nj, j = 
1, . . . , AT}, as 6> varies. Clearly, 

< emaxmax|£,|, 
n j 

where we take the maximum over all possible realizations of {£j}j=i over all 9. 
For each t and j, 7r tJ - takes its values in {1, . . . ,m}, that is, it takes at most m 



values. We have 



N N 
3=1 3=1 



e 2(l-sJ 

= (H-4)e -2 < 1- '> <K + 1. 



K:= l(l + A)e 2{1 - s) \. 



where K is the integer 

K:=[( 

The number of integer sequences {nj}^ =1 with X^!=i nj < K + 1 is equal to 

/iV + K + 2\ 2 Ar +K+2 ^ 4 x 2 (l+2A) e - 2 ( 1 - = )_ 

So the cardinality | H | of II satisfies 

|n| < 4 x 2 ( 1 + 2 ^" 2<1 - s) x ™( 1+A ) e ~ 2(1 ~ s) < (2m) ( - 1+2A ^ 



-2(1-,) 



since A > 1 and m > 4. 

Now, since H^Hoo ^ 1 foi ^ £ we know that for any convex combination 
J2kPk£k, one has E| ^ k Pk£,k\ 2 — V n - Hence Ef| < 1/n for any fixed and thus, 
by Lemma 14. 11 



(8) eEmaxmax|£,| < 2e\/lT2Ar (1 ~ 
n j 



log(6m) 



= 2Vl + 2Ae l 



log(6m) 



We now turn to ii(0). 

By construction, for i — 1, . . . , n, t = 1, . . . ,n,j, j = 1, . . . , N, 

EVv.jO 2 ^) = X/ Pj^k{xi) := 9j(xi) 
keVj 



and hence 



Thus 



E(-07r t .(xj) - gj(xi)) 2 < max {ipk(xi) - ipi(xi)) 2 



E^/jj^i) - g 3 {xi)) 2 < max (ipk(xi) - ipi{xi)) 2 /nj, 
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and so 



Therefore 



E|fe - 9j \\ 2 n < max \\,/, k - ftWl/nj < (2e s ) 2 /n, = Ae 2s /n 3 . 



E 



N 

3=1 



2 N 

= Y,OipUj-9j\\ 

n 3=1 

N a 2 N a 2 e 2 ^-^ 
<4 e 2s V ^<4e 2s Y ^1 



<4e 2 



i=i 



Let E e denote conditional expectation given {£i}™ =1 . Again, by construction 



and hence 
Thus 

So we obtain 



E e (&r tl , -e,-) 2 < max (^ fe -a) 2 - 

re , ( £ Vj 

E e(0 ~ e j) 2 < max (& - 6) 2 /»j- 

re , f E Vj 



E, 



AT 



E^~ 

7=1 



■J e j . 



- 16; -61 

< > QfoEelf,- — e-jl < > a, max 

i'=l J=l ' v J 



i'=l 

W a e 1 - 8 N _ 

^ E -h=- fi** l& - 61 = E ™« l& - 6 

3=1 V J ' j=i 1 

< \/iVe 1_s max max — £;| < V^Amax max — 

j k,l£Vj j k,ldVj 



It follows that, given {£i}™ =1 , there exists a realization 

{(v-,^) = (^w,e;w)}f 



=1 



of {(V'j, 6')}j=i sucn that 
(9) 

as well as 
(10) 

Thus we have 



\Y a M-9j)\\l <4e 

3=1 



N 

E a i(^* ~ e j) 

7=1 



< 2V r Amax max |£t — 6 



Since E|£ fc - £/,| 2 < 2e 2 /n for all k 
(11) 



ii(0) < 2v^4max max — £;| 



j fe.zey,- 

i £ Vj and all j, we have by Lemma |4 



2\/lEmax max |£ fc - £z| < 6vC4e' 



log(6m) 
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N 
3=1 



Finally, consider iii(#). We know that 

IIM|n = 

Moreover, we have shown in ([9]) that 

N 

3=1 



< e. 



< 4e. 



Also 



3=1 

since \\ipj ||oo < 1 for all j . Thus 



N 



N 



<$>;-«JlM*<e, 



3=1 



< 



iV 



3=1 



A' 



E 3 

3=1 



IM|„<6e. 



The total number of functions of the form Xw=i a i'£i ^ s bounded by 



/v 



x ini < 



x (2m 

-2(l-») 



,(l+2A)e 



-2(l-») 



< (2m)( 1+2A ' £ 

since we assume e > 16/m, and ^4 > 1. Hence, by Lemma 14.11 



A 



(12) 



E max max f^a',£* < 12^1 + 2Ae f 
a >eA e n ^ 33 ~ 

3=1 



log(6m) 



We conclude from ©, HI]), and (0, that 



E max 



A 



$>3(%3(0) 
3 = 1 



< 20V1 + 1Ae 



log(6m) 



log(6m) 



log(6m) 



log(6m) 



□ 



Proof of Lemma \2.1\ Let 



sup 

\\fe\\ n <e, I(9)<M 



1 " 



denote the symmetrized process. Clearly, {fg — J2T=i ®k^k '■ 1(0) = 1} is the 
convex hull of *S> := {±ipk}™ = i- Moreover, we have 



N(e,y) < 2iV(e,#). 
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Now, apply Theorem 12. li to \P, and use a rescaling argument, to see that 



EZ e M < 20VTT4^M 1 -V 1Og(12m) . 

V n 

Then from Theorem 14.21 and Theorem 14.31 we know that 

EZ e ,M < 4EZ £j jvf. 

The result now follows by applying Theorem 14. II □ 
4-3. Proofs of the results in Section^ 

The proof of Theorem 13. II depends on the following simple convexity trick. 
Lemma 4.2. Let e > and M > 0. Define /„ = tf n + (1 — t)/* im£/i 

*:=(!+ ||/n - /*!|n/e + /(/n - fn)/M)~\ 
and with f n :— fg and f* := /g*. When it holds that 

e ~ M 

Wfn - /J|r> < g, I{fn - /„) < y , 

then 

\\fn-f:\\n<e, and I(f n -f*)<M. 

Proof. We have 

fn ~ /n = t(f n ~ /n)j 

so \\fn- fn\\n < e/3 implies 

ll/n - /:iln < ^ = (1 + Wfn- f*Wn/e + I{f n - /*)/M)J. 

So then 

(13) ll/n-/X<| + ^(/n-/»). 
Similarly, /(/„ - /*) < M/3 implies 

(14) I(fn~f*)<Y + ^\\fn-fnWn- 
Inserting (fT4)) into (fT3)) gives 

II /n — /nil" — + jll/n — /nil"' 

i-e., ||/„ - f*Wn < e. Similarly, Inserting into d} gives /(/„ - /*) < M. □ 



Proof of Theorem \3 . 1\ Note first that, by the definition of of M n , e n and A„, it 
holds that 



(15) X n ,oe s n M; 



s i\/rl-s 



2 



27(72 ' 
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and also 
(16) 
Define 

where 



(27) 2^ C ~~ An"* M' n 



27*?, ■ 



te n + (i-t)e* n , 



*:=(!+ \\fg n - fe*\\n/tn + I(f§ n - fe^/Mn)- 1 . 

We know that by convexity, and since 9 n minimizes the penalized empirical risk, 
we have 

_2_ 2(l-a) - 

p n7/ff _ +Ar s /T^(0„) 

.2(1-3) , „ 



< t^P n 7/,„ + \n S I^^{e n ) ) + (1 - *)( Pn7/« + 

This can be rewritten as 

P(rf ffn - ih n .) + >£~'i^ (On) < -(Pn - P)(7 fSn - 7/ 9 » ) + Ar s i^=r (e* n ). 

Since I(fg — < M„, and ||V>fe||oo < 1 (by Assumption A), we have that 
|| fg — /e» |[ oq < M n . Hence, by Assumption M, 

^(7/ 5 „-7/ 9 „J>H/ e -„-/edln/^- 

We thus obtain 

\\h-f*\\l + ^ lS ^ {§ r) 

°n 



<-(p„-p)( 7/ _ 7/e , ) + 2A^r 



;)■ 



Now, || /g — /#* ||„ < e„ and I(9 n — 9 n ) < M„. Moreover e n /M n < 1 and in view 
of (0, e n /M n > 8/m. Therefore, we have by ([6]) and Thcorcm l4.1[ with probability 
at least 



1 — cxp 



2 x (27a2 



2\2 



that 



II fit - fo* II* -2- 2(1- S ) - 



< A„n<Mi- s + 2Ar s r 



2(l~a) ,„ 



27(t2 

2(1-°' 



< Aj^oe* M„ s + (27) — c- — A^- 3 M„ 2 - 3 + 



27d2 
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where in the last step, we invoked (|15|) and (|16l) . 
It follows that 

Wf§n _ •foil" - Y' 

and also that 
since c > 3. 

To conclude the proof, apply Lemma l4~2l □ 
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